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ABSTRACT 


In the era of industrial digitalization, people are increasingly investing in 
solutions that allow their process for data collection, data analysis and 
performance improvement. In this paper, advancing web scale knowledge 
extraction and alignment by integrating few sources by exploring different 
methods of aggregation and attention is considered in order focusing on 
image information. The main aim of data extraction with regards to semi- 
structured data is to retrieve beneficial information from the web. The data 
from web also known as deep web is retrievable but it requires request 
through form submission because it cannot be performed by any search 
engines. As the HTML documents start to grow larger, it has been found that 
the process of data extraction has been plagued with lengthy processing time. 


Web data extraction 
Wrapper extraction of image 


In this research work, we propose an improved model namely wrapper 
extraction of image using document object model (DOM) and JavaScript 
object notation data (JSON) (WEIDJ) in response to the promising results of 
mining in a higher volume of image from a various type of format. To 
observe the efficiency of WEIDJ, we compare the performance of data 
extraction by different level of page extraction with VIBS, MDR, DEPTA 
and VIDE. It has yielded the best results in Precision with 100, Recall with 
97.93103 and F-measure with 98.9547. 
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1. INTRODUCTION 

The numbers of devices and gadgets connection to the Internet is on the rise. This increase in 
internet’s connection makes the web as the largest source of information worldwide. With the large amount 
of data residing in the web, and complemented by advanced technologies in database processing, it is 
therefore a seamless effort to gather, collect and process the data. As the consequence of the exponential data 
growth, it is most important for users to adopt advanced data analytics technologies for an efficient storage, 
retrieval and analysis of the data. The main aim is to usefully utilize this data, to learn about patterns and 
trends that can be used to make a positive impact on our lifestyle. However, the data itself doesn’t produce 
these objectives, but rather it’s solutions that arise from analyzing it and finding the answers we need. This 
accumulation of data in terms of volume, technology and techniques are often being discussed in relation to 
mine data from world wide web. Figure 1 shows the number of scholarly works over time by their 
publication type such as book, dissertation, journal article, report, conference proceeding and so forth via 
lens.org. From this graph, it can be easily seen the trend in this research field. 
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Figure 1. Number of scholarly works from 1970 till 2020 


Mining data uncover new facts and relationships using useful patterns and techniques in order to 
give a solution for handling big data [1]. Data mining techniques are implemented to find useful patterns in 
large database such as MySQL and Oracle. It is the process that tries to discover patterns or techniques that 
can be applied in large dataset [2]. The main goal of data mining is to extract information from large dataset. 
Enough data and supported tools are important and need to complement each other’s in dealing with large 
data set. It may be leveraging onto the implementation of the big data that provides great opportunities for 
various of fields such as e-commerce, industrial controls and smart medicals [3]. However, the characteristics 
of large volumes, large varieties, large velocities and large veracities of information need to be considered in 
order to handle the challenging for data mining [4]. Finally, the extracted information will be transformed 
into a structured way for further use. Web mining is the application of data mining techniques to discover 
potential information automatically from the web. 

In relation to Figure 2, web mining is divided into three categories; web content mining, web 
structure mining and web usage mining. Web content mining is all about discovering useful content on the 
world wide web (WWW) by using data integration and data extraction. Web structure mining places websites 
and web pages that contain in a network of connected websites by using hyperlinks. A hyperlink is an 
element in HTML documents that links an object such as text, image, and video. to another HTML document 
altogether. In other hand, web usage mining focuses on browsing behavior either using pattern track or 
personalize usages track. This paper focuses on web multimedia mining focuses on images. 
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Figure 2. Web mining categories 
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Mining data or extracting data from web pages is a major feature for human to lead and get huge 
benefits. Websites are designed for various people and they are known as semi-structured data. The structure 
of each web page is different for each page. Thus, it is not easy to capture all the data in different 
structure [5] and many studies discuss about extracting data from websites and various methods have been 
developed. The large volume of images and their information requires new solutions to manage and analyze 
them. We have proposed wrapper extraction of image using document object model (DOM) and JavaScript 
object notation data (JSON) (WEIDJ) in order to address this concern. The main motivation for this research 
is image’s extraction, mining of image’s details and its storage in single multimedia database. In ideal 
scenario, if image need to be saved, it should be manually extracted. Extraction and saving of required files 
or images is important since these documents can be beneficial for further purpose. However, problems in 
loading times exist when the size of the images to be extracted are too big. Therefore, another solution must 
be developed to automatically extract the images to reduce the consumed time. A data extraction engine 
should be be able to extract all the required from web page. The initial step in extracting data from a specific 
web page is to define the uniform resource locator (URL) of the web page, where the data is located. 


2. DATA EXTRACTION 

Data extraction is where data is been analysed and crawled through from data sources such as web 
or databases. It depends on specific patterns of user requirements. The goal of data extraction is to retrieve 
relevant information. It organizes data into usable and valuable resource so that we can use for further 
purposes. The extraction process may involve different data types. Prior to extraction processes, data needs to 
be well organized. If the data is in a structured format, it will be more applicable. There are three types of 
data; structured data, semi-structured data and unstructured data. There are many ways to deal with all these 
types of data. This research focused on the extraction of semi-structured data. There are three basic steps in 
data extraction process as shown in Figure 3. 

The advantages of data extraction from semi-structured data is that it can be applied in various fields 
such as in education [6], advertisements [7], housing managements [8]. In former works, the discussed data 
extractions have been modelled using a single model or combination of several models for an optimum 
assessment [9] While web has developed into a large source of information, there are different data types of 
information that will be discussed in next section. This paper aims to advocate the potential of two-phase 
query paradigm for web mining. Our extensive experiments indicate by following criteria: 

- Having an explicit target for the extraction process. 

- Using a large set of information from several website which also has different structure. 

This approach turns out to be highly effective in practice. In our view, these results hint that a fully automatic 
solution for querying the structured images and related information, non-hidden images refer to Figure 4 
including aspects of the structure for each web and the redundancy of the images Figure 5. 
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Figure 3. Data extraction process Figure 4. Images cannot be retrieved 
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Figure 5. Redundancy of images 


3 WEB DATA EXTRACTION (WDE) 

The importance of web data extraction (WDE) depends on the fact that large amounts of data are 
continuously been generated, shared and utilized in every second. WDE techniques are implemented to 
reduce labor intensive tasks and play important roles in raising the accuracies in data extraction. Many 
factors should be considered in designing WDE including the techniques. One of the critical factors is the 
ability of the developed techniques in processing a large amount of data in a short time. 

Web data extraction system is a software application that can extract data from web sources [10]. 
This application usually interacts with a web source and extracts the stored data. The extracted contents 
consist of elements in the HTML web pages and can be post-processed, converted to the most appropriate 
structured format and stored for further usage. Table 1 shows web data extraction tools that are using 
different techniques. 


Table 1. Web data extraction tools 


(Author, year) Tools Model 
Fang, Xie [11] STEM Suffix Tree Based Method 
Pouramini, Khaje Hassani [12] Handle-based Wrapper DOM Tree 
Jiménez and Corchuelo [13] TANGO DOM 
Chitra and Aysha Banu [14] DWDE Tag based Feature 
Tripathy, Joshi [15] VEDD DOM Tree Breadth First Search (BFS) 
Derouiche, Cautis [16] ObjectRunner 
Liu, Pu [17] XWRAP DOM Tree 
Chang and Kuo [18] OLERA 
Liu, Grossman [19] MDR 
Cai, Yu [20] VIPS DOM Tree Visual Cues 
Crescenzi, Mecca [21] Road Runner - 
Chang and Lui [22] IEPAD Pattern Discovery 
Hsu and Dung [23] SoftMealy - 
Hammer, Garcia-Molina [24] TSIMMIS Object Exchange Model (OEM) 


4 RESEARCH METHOD 

Prior to extraction processes, data needs to be well organized. If the data is in a structured format, it 
will be more applicable. There are three types of data; structured data, semi-structured data and unstructured 
data. There are many ways to deal with all these types of data. This research focused on the extraction of 
semi-structured data. There are three basic steps in data extraction process; selection, transformation and 
knowledge [25]. Web wrapper is a procedure which is implemented based on any of the specified algorithms. 
The goal is to seek and find data required by human users from the web sources, which includes unstructured 
or semi-structured data. The extracted data will be transformed into structured representation for further 
usage. Lately, the problems of extracting information from unknown sites, focusing on unstructured or 
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semi-structured data are getting much attention from the researchers. The works on WDE has lots of reviews. 
This section discusses about our proposed method, WEIDJ. 

WEIDJ is developed to assist user in extracting semi-structured data from web page. A web page 
can be represented by a tree structure DOM. It converts and store a given web address of web page from a 
search engine into a DOM tree [26]. Recently, the extraction process is focused on image [27, 28]. When user 
input the uniform resource locator (URL) and the query is submitted to a search engine, the search engine 
will dynamically generated result page containing the result records. The results consist a link path for each 
element of image, image, size of image and time processing to load each image [29]. WEIDJ used alpha jet 
experiment (AJAX) technology to extract data from web sources. AJAX, is the abbreviation of 
Asynchronous JavaScript and XML, is a set of web development techniques that allows a web page to update 
portions of contents without having to refresh the page. AJAX represents a similar concept to the 
client-server development. During client-server phase, the amount of data transferred is very minimal over a 
terminal application by transferring only the necessary data back and forth. Similarly, with AJAX, only the 
necessary data is transferred back and forth between the client and the web server. This minimizes the 
network utilization and processing on the client. The time for extraction process has been reduced. Figure 6 
shows an overview of WEIDJ using AJAX and JSON data. 


Web browser 


ees Se oe 


Ajax “engine” (javascript) 


HTTP request 
JSON data 


Web server 


Multimedia database 


Server-based system 





Figure 6. Overview of WEIDJ model 


It can be difficult to properly create extraction rules describing required data. In this paper, we 
propose WEIDJ [30] model to extract images from a web page. The work described in this section uses a 
combination of both techniques, DOM and JSON [31]. In addition, we also do the checking of images by 
blocks in the HTML documents. It also focusses on arranging the extracted data in a tabular format. Lots of 
applications focuses on extracting information and then have it arranged accordingly [32, 33]. Every web page 
has their own structure includes main topic, related topics, additional information, advertisement, contact 
information, images, audio and video file. The problem that we want to solve is what is the best technique can 
be applied in order to extract images automatically [34, 35]. Mining information records in data regions plays 
important role in defining tags of semi-structured data. It is easy to extract data from data regions because it 
contains useful data. It is recognized as data area. A technique is requisite in order to mine data area. In the 
earlier stage, this model proposed DOM tree as based technique to mine data regions in web page. 


4.1. WEIDJ algorithm 

The industrial of big data completes the index function of big volumes of data especially in 
extracting image as the data of preference. There are many other researchers who work on data extraction 
from different sites in order to test the performances of extraction. In this research work, we retrieve images 
and their information from web sources to be analysed for further usage. A mediator tool call WEIDJ 
approach has been proposed. This tool aims to extract images according to uniform resource locator URL. 
The image’s details will be mined and presented in a structured format before storing them into multimedia 
database. In this research, we propose a mediator tool call WEIDJ approach. This tool aims to extract images 
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according to uniform resource locator URL and mine image details then present images in structured format 
before storing them into multimedia database. 

In WDE, a web-based method based on DOM is applied. DOM provides a structured way to 
describe documents. The HTML documents will be converted into DOM tree structure. Each element in the 
tree structure is known as node. The main task of data pre-processing in web data extraction includes pre- 
built the DOM tree of the web page. This wrapper will analyse the specific targets in the sources of Internet 
world, websites. First, it obtains the relative of URL from a website. Each URL may contains a few, 
hundreds or thousand images. Information will be extracted from from different levels of web pages such as 
single web, different sources of web pages and deep web. The extraction of information need to deal with 
page refinement to clean and extract useful information such as images, path of images, size of images and so 
forth in the rule of extraction. This wrapper is proposed to extract images from web. In this way, the 
processing of images will be converted into a form of computer processing; which is represented by the 
extraction of data in tabular format. This representation is important in order for providing research analysis 
of data extraction. Figure 7 describes the whole process of the realization of WDE. 
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Figure 7. The process of the image’s extraction 


5. RESULTS AND ANALYSIS 

In this research work, web data extraction experiments had been set up to compare the performance 
of WEIDJ with existing method. The software configuration that has been used in this experimentation can 
be referred in the previous work [35]. The findings of experiments tabulated in Figure 8 shows that when the 
amount of extracted images increase, the time of the two retrieval methods; DOM and WHDJ are increased 
but the time of WEIDJ on images extraction is significantly lower than other approaches. Five different 
websites from biodiversity field has been selected to test the performance of web data extraction as shown in 
Table 2. Each website has different data volume and different data size. For a web data extraction 
experiment, different data volume and data size are been tested by four different extraction methods. 

This paper also selects the website of FangJia which is http://sh.FangJia.com as show in Figure 9. 
The reason why this website is selected as a guideline because there is a discussion of findings for image 
extraction that has been constructed [27]. Four typical data extraction algorithm VIBS, MDR, DEPTA and 
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VIDE were selected as comparing target. The experiments were conducted on the prototype system of the 
above algorithm. There are two types of performance measurement that have been conducted during this 
experiment. The first measurement is execution time analysis and second is precision, recall and F-measure. 


Performance of Extracting Images for Web Crawler (Time Extraction) 
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Figure 8. Performance of image extraction for deep web 


Table 2. Characteristics of instant dataset 


URL Uniform Resource Locator (URL) Domain 
General Biodiversity and Endangered Species Information 
l http://www.amnh.org/ American Museum of Natural History (AMNH) Hall of Biodiversity 
2 http://ocean.si.edu/ Ocean Portal: Smithsonian Institution 
3 http://www.1iucn.org/ International Union for Conservation of Nature 
4 http://www.endangeredspeciesinternational.org Endangered Species International 
5 http://www.wwf.m World Wide Fund for Nature 
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Figure 9. Structured pages from FangJia.com 


5.1. Time extraction analysis 

In this experimental work, 40 pages from the same website FangJia (https://fangjia.fang.com/bj/) 
have been selected randomly. Then, the extraction time will be calculated from the beginning of the extracted 
page to the next page. Figure 10 shows the sample output for extracting 40 pages by corresponding page. The 
duration of the extraction process is shown in details in Table 3. From the performance analysis, in the 
preliminary for 5 and 10 pages VIBS it is excellent in extracting images but when the HTML documents 
become larger, we found that WEIDJ clearly outperforms existing tools. 


5.2. Precision, recall and F-measure 
According to [27], the interference of web page noise to data extraction is important to be 
considered besides efficiency and accuracy of different deep web page heterogeneity. This issue motivates us 


WEIDJ: Development of a new algorithm for semi-structured web data...(Ily Amalina Ahmad Sabri) 


324 O 


to improvise existing algorithm on noisy information. So, besides focusing on the performance of time 
extraction for extracting information, we also want to extract the significant information of image and 
remove the noisy information. Table 4 shows the result of the experimental evaluation for WEIDJ using 
FangJia webpage as tested URL. Figure 11 shows the comparison of the five algorithm of the experiments. 
Our model, WEIDJ has proven that its ability to extract data as accurate as VIBS. This accuracy in extraction 
is achieved because of two factors that we include in this research, which are noises filteration and the use of 


JSON which helps to transform the data faster. 
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Figure 10. Extracting 40 pages 


Table 3. The performance of data extraction 
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Method Time Extraction 
5 pages 10 pages 15 pages 20pages 25 pages 30pages 35 pages 40 pages 
WEIDJ 12.6972 18.639 22.18 29.1468 29.5079 35.2651 37.977 48.8498 
VIBS 7.25 12.7 23.74 30 35.01 44.37 49.76 62.69 
MDR 19.29 40.11 61.18 83.78 101.07 122.63 148.33 164.16 
DEPTA 20.98 43.79 66.66 90.63 114.04 135.72 153.55 180.71 
VIDE 53.13 94.37 144.33 195.23 246.29 291.08 341.18 389.52 


Comparison of Performance with Other Methods 
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Figure 11. Comparison performance existing method 
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Table 4. Result of the experimental evaluation for WEIDJ 
Total img Data retrieved Data (False) Unknown Data Precision Recall Fl 


145 142 0 3 100 97.93103 98.9547 


6. CONCLUSION 

All the World Wide Web has become a vast information store that is growing at a rapid rate, either 
in number of sites or in volume of useful information. WDE is time consuming when the html documents 
becomes larger. Single DOM did not perform very well in extracting multimedia data such as image if the 
volume of data become increased. However, when another technique JavaScript object notation is 
implemented in enhanced model namely as wrapper hybrid DOM and JSON (WHDJ), the time execution in 
extracting image and its information has been reduced to 50% greater than DOM technique. Even the time 
execution has improved but the limitation of this model is the redundancy of similar filename in images 
extraction. Complementary to this, we intend to combine both approaches and apply visual segmentation to 
get the best performance and extract the constructive images. This wrapper has been developed based on 
proposed model, WEIDJ. The findings result of time execution of WEIDJ is greater (90%) than existing tools 
should be interpreted because of the page level of extractions which is deep web, used in the analysis of 
experimentation for the execution time. In this study, the benchmark of dataset (FangJia) and biodiversity 
websites were heterogeneous with respect to image, path of images, size of images and execution time. 
Beside the execution time is focused as main guideline, the experimentation of images extraction would have 
improved the validity of significant information by removing noisy information of images. In future studies, 
it is recommended that the selection of dataset involves variety of fields which includes social networks or 
other platform. This is because the structure of website have been developed in different technologies. 
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